There is an infinite number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena.
These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those.
The objective of the analysis is to discover whether an asteroid is potentially dangerous for the earth or not.
In order to do that, we have different Machine Learning algorithms at our disposal to do binary classification.
The dataset analyzed for this analysis is taken from Kaggle and contains more than 90’000 features regarding 27’000 objects in the space.
Three statistical models for classification are applied:
-Probit Regression
-Logistic Regression (with Logit link function)
-Complementary log-log regression (cloglog)
The parameters of the models are tuned with Monte Carlo Markov Chain methods, and a fully bayesian in-depth analysis is applied to them.
The analysis is divided in 4 parts:
Now, let’s move on.
| ID | Name | MinDiameter | MaxDiameter | RelativeVelocity | MissDistance | OrbitingBody | SentryObject | AbsoluteMagnitude | Hazardous |
|---|---|---|---|---|---|---|---|---|---|
| 2162635 | 162635 (2000 SS164) | 1.198 | 2.679 | 13569.25 | 54839744 | Earth | False | 16.73 | False |
| 2277475 | 277475 (2005 WK4) | 0.266 | 0.594 | 73588.73 | 61438127 | Earth | False | 20.00 | True |
| 2512244 | 512244 (2015 YE18) | 0.722 | 1.615 | 114258.69 | 49798725 | Earth | False | 17.83 | False |
| 3596030 | (2012 BV13) | 0.097 | 0.216 | 24764.30 | 25434973 | Earth | False | 22.20 | False |
| 3667127 | (2014 GE35) | 0.255 | 0.570 | 42737.73 | 46275567 | Earth | False | 20.09 | True |
The dataset is composed by 10 variables:
-ID: ID number of every asteroid
-Name: the name of the asteroids
-MinDiameter: the minimum diameter estimated of the asteroids (in km)
-MaxDiameter: the maximum diameter estimated of the asteroids (in km)
-RelativeVelocity : the velocity relative to the Earth (in km/h)
-MissDistance: the distance between the Earth and the asteroid (in km)
-OrbitingBody: the planet that the asteroid orbits
-SentryObject: (binary variable) explains if an asteroid is included in an automated collision monitoring system
-AbsoluteMagnitude: intrinsic luminosity of the asteroid. The more negative is the magnitude, the brighter is the asteroid (in vmag)
-Hazardous: Boolean feature that shows whether asteroid is harmful or not. It is be the output variable and the analysis will be based on the prediction of it.
Of course, before analyzing the data, I need to clean them in order to remove useless information.
-1. ID and Name are useless information, so I will remove them. Also OrbitingBody and SentryObject have the same value for every row. I can remove them.
-2. OrbitingBody, False and the output Hazardous are binary variables (True/False), so I’m gonna transform them into \(0\) and \(1\).
After observing that OrbitingBody is composed \(100\%\) by ‘Earth’ and that SentryObject is only False, I’ve dropped these columns
One of the most important steps, after understanding the features meaning and a first data cleaning, is the Data Visualization.
The first data I want to observe is the distribution of Riskful and Non-Riskful asteroids over the labels
We can observe that only approximately one asteroids over ten is potentially hazardous.
So now we can pass to the observation of these asteroids.
The Visualization’s first purpose is to compare the features of the dangerous objects with the non dangerous.
As shown in the plot above, the risky asteroids are more likely to be “fast”. In fact, the faster is an asteroid, the higher is the potentially risk of it.
Looking at the maximum diameter of the asteroids we can notice that the risky ones tend to have an higher diameter, and it causes higher danger. The curves have been approximated with spline method.
Of course the magnitude seems to be a very significant variable for the analysis! the riskful asteroids have a definitely smaller magnitude.
It means that they bright more than the not risky ones, and this brightness causes peril.
Looking at the boxplot of the distances of the objects, we can notice that the maximum distance is the same for both riskful and not: it’s because we are selecting the asteroids that fluctuate around the Earth, so they are no more distant than approximately 75 millions kilometers.
Instead, looking at the overall distribution and minimum distance, the riskful ones are clearly closer to the Earth.
From the correlation matrix we can observe that MinDiameter and MaxDiameter have the maximum correlation possible. It’s because they explain the same information, and keeping both may be useless for the analysis.
This part will be reviewed in the feature engineering.
The AbsoluteMagnitude is negatively correlated with the diameter: it’s because the higher is an object, the higher may be its bright and so the lower is the magnitude.
Incorporation of same variables: MinDiameter and MaxDiameter represent the same variable (the diameter of the asteroids). Keeping both these variables may cause redundances and unuseful calculations.
Given that these two measures are hypothetical, I’m gonna incorporate them into an only variable: HypotheticalDiameter.
Standardization of the variables: some variables are too big, so I need to standardize them in order to increase the efficiency of the algorithms. This step will be done after the data visualization.
The two main reasons why it’s important to standardize these variables are
-standardization increases the efficiency of the further algorithms.
-the data have different unit of measure
| RelativeVelocity | MissDistance | AbsoluteMagnitude | HypotheticalDiameter | Hazardous |
|---|---|---|---|---|
| -1.364 | 0.795 | -2.349 | 3.587 | 0 |
| 1.009 | 1.090 | -1.219 | 0.464 | 1 |
| 2.617 | 0.570 | -1.969 | 1.992 | 0 |
| -0.921 | -0.520 | -0.459 | -0.104 | 0 |
| -0.211 | 0.412 | -1.188 | 0.427 | 1 |
The purpose of the analysis is to forecast whether an asteroid is potentially hazardous for our planet or not.
Before studying the distributions, the division in train and test set is fundamental. I randomly split the dataset into a \(80\%\) train-set and \(20\%\) test-set.
The proportions are approximately the same, so I can proceed with the analysis.
The best way to predict the hazard of an asteroid is to apply the binary classification algorithms: the first approach is the implementation of a probit model.
The model takes the following form:
\[P(Y=1|X)=\Phi(X^T\beta)\]
Where \(P\) denotes the probability and \(\Phi\) denotes the cumulative distribution function of the Normal Standard distribution (\(N(0,1)\)), that is used to model the regression function. Instead, the parameters \(\beta\) have to be estimated.
It is possible to motivate the probit model as a latent variable model. Suppose there exists an auxiliary random variable:
\[Y^*=X^T\beta + \epsilon\] where \(\epsilon \sim N(0,1)\), then \(Y\) can be viewed as an indicator for whether the latent variable is positive.
\[ \mathrm{Y^*} = \begin{cases} 1 & \text{if } Y^* > t \\ % & is your "\tab"-like command (it's a tab alignment character) 0 & \text{otherwise} \end{cases} \]
Gibbs sampling of a probit model is possible because regression models typically use normal prior distributions over the weights, and this distribution is conjugate with the normal distribution of the errors (and hence, but not in this case, of the latent variables Y*). The model can be initialized as
\[\beta_i \sim N(0, 0.0001)\]
\[ Y_i \sim Ber(probit(p_i)) \]
N = length(train$Hazardous)
probit_reg <- function(){
# Likelihood
for (i in 1:N){
y[i] ~ dbern(p[i])
probit(p[i]) = beta0 + beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i]
}
#Prior beta parameters
beta0 ~ dnorm(0, 0.0001)
beta1 ~ dnorm(0, 0.0001)
beta2 ~ dnorm(0, 0.0001)
beta3 ~ dnorm(0, 0.0001)
beta4 ~ dnorm(0, 0.0001)
}
set.seed(123)
model1 = jags(data = data_sim,
model.file = probit_reg,
parameters.to.save = params,
n.chains = 2,
n.iter = 10000,
n.burnin = 1000,
n.thin = 10)
| μ.vect | σ.vect | Quantile_0.25 | Quantile_0.50 | Quantile_0.75 | R.hat | N_eff | |
|---|---|---|---|---|---|---|---|
| β_0 | -2.148 | 0.043 | -2.163 | -2.150 | -2.138 | 1.001 | 1800 |
| β_1 | 0.116 | 0.008 | 0.111 | -0.116 | 0.121 | 1.001 | 1800 |
| β_2 | -0.234 | 0.011 | -0.241 | -0.234 | -0.227 | 1.004 | 450 |
| β_3 | -1.626 | 0.052 | -1.645 | -1.629 | -1.612 | 1.001 | 1800 |
| β_4 | -0.638 | 0.026 | -0.650 | -0.640 | -0.629 | 1.001 | 1800 |
| Deviance | 32127.261 | 319.914 | 32113.733 | 32115.356 | 32117.769 | 1.000 | 1800 |
DIC = 83328.3 \(p_D\) = 51201.0
The parameters shown in “\(\mu.vect\)” are the approximation of the real parameters estimated by the MCMC.
Instead, the “\(\sigma.vect\)” represents the standard deviation and can be interpreted as posterior uncertainty of the parameters: \(\beta_3\) has in fact the highest uncertainty, while \(\beta_1\) has the lowest.
The parameter \(\hat R\) tells me if there is convergence for the parameters: the closer is \(\hat R\) to 1, the highest is the plausibility of a convergence.
The DIC (Deviance Information Criterion) is an asymptotic approximation as the sample size becomes large (it’s a generalization of AIC for the frequentist approach). It’s represented as
\[DIC = D_{\hat \theta}(y)+2p_D = \hat D_{avg}(y)+p_D= \\ 2\hat D_{avg}(y)-D_{\hat \theta}(y)\] Where
\(D_{\hat\theta}(y)=-2log f(y,\hat\theta(y))\),
\(\hat D_{avg}(y) \approx\frac{1}{M}\sum_j -2logf(y|\theta^{(j)})\)
and \(p_D\) is the effective number of parameters.
The minimum DIC estimates the “best” model in the same spirit as AIC.
The DIC is a comparative index, so I will compare it with all the models in order to choose the best one.
Let’s observe now the stationary regions
The traceplot explains the pattern followed by the parameter for every iteration of the Markov Chains.
From the plot we can see that the processes look stationary: it means that the trend of the parameter, with infinite iterations, becomes constant.
If \(I<\infty\) and \(\theta_1,...\theta_t\) iid, for the Strong Law of Large Numbers (SLLN), the empirical average is a consistent (sequence of) estimator(s) of \(I\), such as \[ \hat{I}_{t} = \frac{1}{T} \sum_{i=1}^{T} h(\theta_i)\xrightarrow{\text{a.s.}} I \] So I implement the estimator in the following way and apply it to every parameter.
Each parameter’s Empirical Average converge to the same point: it means that, also with different initial points for the two chains, the estimated mean parameter will be approximately the same.
From the density plot we can observe the distribution of the estimated parameters.
The mean of the distribution represents the estimate of the parameters (“\(\mu.vect\)”). The higher is the curve on a certain x, the more plausible is that value as parameter. As we can observe, the distributions are mainly concentrated around the estimated parameters, but another way to study the probability of the parameters is with the Credible Intervals.
There are two types of credible intervals: Equal-Tail and Highest Posterior Density (HPD).
If the distribution is not unimodal and symmetric there could be points points out of the ET interval having a higher posterior density than some points of the interval. So in this case it’s better to study the Highest Posterior Density interval (HPD).
The particularity of the HPD interval is that the posterior density for every point in the confidence region \(I_{\alpha}\) is higher than the posterior density for any point outside of this set.
The HPD intervals at level \(\alpha = 0.05\) are the following
| lower | upper | |
|---|---|---|
| beta0 | -2.2048209 | -2.1278060 |
| beta1 | 0.1027309 | 0.1314259 |
| beta2 | -0.2548592 | -0.2171067 |
| beta3 | -1.7178587 | -1.6179294 |
| beta4 | -0.7133528 | -0.6489877 |
| deviance | 32236.4756045 | 32248.1679345 |
The intervals effectively contain the values of the estimated parameters.
The autocorrelation is the is the Pearson correlation between values of the process at different times, as a function of the two times or of the time lag.
| beta0 | beta1 | beta2 | beta3 | beta4 | deviance | |
|---|---|---|---|---|---|---|
| Lag 0 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 |
| Lag 10 | 0.3612700 | 0.0001370 | 0.0623235 | 0.3980965 | 0.3595515 | 0.0438963 |
| Lag 50 | 0.0332511 | -0.0127798 | 0.0135061 | 0.0364965 | 0.0278903 | 0.0001390 |
| Lag 100 | -0.0152833 | 0.0222933 | -0.0405888 | -0.0126048 | -0.0118249 | 0.0002038 |
| Lag 500 | 0.0117260 | -0.0211545 | 0.0179715 | 0.0066646 | 0.0034373 | 0.0000932 |
From the ACF plots and the table we observe that, with the augmenting of the iterations (and so of the lag), the autocorrelation decreases until around zero. It means that there is not correlation between different values of the chain.
We can observe that the autocorrelation of \(\beta_1\) and \(\beta_2\) decreases faster than the other parameters.
Above we’ve seen the plots regarding the behavior of the Markov Chains from different point of view.
Furthermore, looking closely to the graphs, we can notice that \(\beta_0,\beta_3\) and \(\beta4\) follow very similar patterns and have a similar distribution.
These aspects may suggest us to look at the correlation between them and to see whether these values are high:
As expected, there is high correlation between them!
To evaluate the approximation error I use the MCSE estimate: the MCSE (Monte Carlo Standard Error) is an estimate of the inaccuracy of Monte Carlo samples, usually regarding the expectation of posterior samples from Monte Carlo Markov Chain algorithms.
I estimated the MCSE with the \(MCSE\) function from LaplacesDemon package obtaining the following results:
| Approximation_Error | |
|---|---|
| β_0 | 0.0030455 |
| β_1 | 0.0002415 |
| β_2 | 0.0004255 |
| β_3 | 0.0037533 |
| β_4 | 0.0018643 |
The highest approximation error is the \(\beta_3\)’s, and the lowest is the \(\beta_1\)’s.
Now that the parameters have been estimated, it’s time to do the first forecast on the test set:
#Parameters
beta0 = model1$BUGSoutput$summary["beta0", "mean"]
beta1 = model1$BUGSoutput$summary["beta1", "mean"]
beta2 = model1$BUGSoutput$summary["beta2", "mean"]
beta3 = model1$BUGSoutput$summary["beta3", "mean"]
beta4 = model1$BUGSoutput$summary["beta4", "mean"]
x = test[1:4]
y = test[,5]
XTb = beta0 + beta1*x[1] + beta2*x[2] + beta3*x[3] + beta4*x[4]
#predictions for a given threshold t
predictions_t = function(xTb, t){
preds = rep(NA, length(XTb[,1]))
predictions = rep(NA, length(XTb[,1]))
for(i in 1:length(XTb[,1])){
preds[i] = pnorm(XTb[i, 1])
}
for(i in 1:length(XTb[,1])){
if(preds[i]>t)
predictions[i] = 1
else
predictions[i] = 0
}
return(predictions)
}
Before selecting the best threshold, it’s good to observe the ROC curves for the different thresholds in order to choose the one with the best performance for the problem.
In fact, the main objective of the task is to discover as most hazardful asteroids as possible increasing the True Positive Rate (TPR/Recall) and Precision, trying to avoid false negatives (decreasing FNR), also at cost to sacrifice some false positives (FPR).
It’s better to know which objects are going to collide with the Earth, than knowing which are not dangerous!
The best threshold for this task is \(t=0.35\).
In fact it gives us the following values:
The confusion matrix for the threshold is the following
1. Geweke Diagnostic is based on a test for equality of the means of the first and last part of a Markov Chain (by default the first 10% and the last 50%) .
The null hypothesis in the test is that two parts of the chain come from the same distribution, and to study it it’s used a mean difference test.
If the two averages are equal it is likely that the chain will converge.
The test statistic is a standard Z-score: the difference between the two sample means divided by its estimated standard error adjusted for autocorrelation.
| Z.score_Chain1 | Z.score_Chain2 | |
|---|---|---|
| β_0 | 1.377 | 0.9585 |
| β_1 | -1.303 | -1.9106 |
| β_2 | 1.315 | 0.8069 |
| β_3 | 1.325 | 0.9575 |
| β_4 | 1.412 | 1.0057 |
| deviance | 1.061 | 1.0609 |
The plot shows that almost all the Z-scores evaluated are into the acceptance area, so we accept the null hypothesis.
2. Raftery & Lewis Diagnostic estimates the number of iterations needed for a given level of precision in posterior samples.
What we want to measure is some posterior quantile of interest q.
If we define some acceptable tolerance \(r\) for \(q\) and a probability \(s\) of being within that tolerance, the Raftery and Lewis diagnostic will calculate the number of iterations \(N = T_{min}\) and the number of burn-ins \(M\) necessary to satisfy the specified conditions.
The results are:
Quantile = 0.025
Accuracy = +/- 0.005
Probability of Accuracy = 0.95
3746 samples are needed for the values above.
3. Heidelberger & Welch Diagnostic is a convergence test that test whether the sampled values come from a stationary distribution.
It’s applied firstly to the whole chain, then discards \(10\%\) of the chain until the null hypothesis is discarded. If the \(50\%\) is discarded, we reject the null hypothesis.
| Stationarity_test_chain1 | Start_iteration_chain1 | p.value_chain1 | Stationarity_test_chain2 | Start_test_chain2 | p.value_chain2 | |
|---|---|---|---|---|---|---|
| β_0 | passed | 91 | 0.454 | passed | 91 | 0.1845 |
| β_1 | passed | 1 | 0.733 | passed | 1 | 0.1113 |
| β_2 | passed | 1 | 0.203 | passed | 1 | 0.3172 |
| β_3 | passed | 91 | 0.233 | passed | 181 | 0.0939 |
| β_4 | passed | 91 | 0.207 | passed | 91 | 0.8046 |
| deviance | passed | 1 | 0.577 | passed | 1 | 0.6052 |
| Halfwidth_test_chain1 | Mean_chain1 | Halfwidth_chain1 | Halfwidth_test_chain2 | Mean_chain2 | Halfwidth_chain2 | |
|---|---|---|---|---|---|---|
| β_0 | passed | -2.167 | 2.68e-03 | passed | -2.169 | 2.78e-03 |
| β_1 | passed | 0.117 | 4.94e-04 | passed | 0.117 | 4.96e-04 |
| β_2 | passed | -0.235 | 7.69e-04 | passed | -0.236 | 8.05e-04 |
| β_3 | passed | -1.672 | 3.61e-03 | passed | -1.675 | 3.71e-03 |
| β_4 | passed | -0.682 | 1.98e-03 | passed | -0.684 | 1.94e-03 |
| deviance | passed | 32252.650 | 2.10e+01 | passed | 32252.524 | 2.10e+01 |
Every test has been passed succesfully.
After observing the behaviour and the goodness of the parameters, it’s time to compare the obtained model with the estimated parameters of the frequentist approach.
In order to do that I decide to use the \(glm\) package from R.
freq_probit = glm(Hazardous ~ HypotheticalDiameter + RelativeVelocity + MissDistance + AbsoluteMagnitude, family = binomial(link = "probit"),data=train)
AIC = 32250
The estimated parameters are pretty close to the parameters studied in the Bayesian approach.
In order to compare the models, I use the same techniques used above (ROC curve + evaluation of the indices from the confusion matrix)
The best threshold is \(t=0.2\) that gives me the following results:
The confusion matrix for the threshold is the following
Another way to estimate binary outputs is the Logistic Regression with the logit link function. The assumptions are the following:
\[Y_i \sim Ber(logit(p_i))\] \[logit(p_i)=log\bigg(\frac{p_i}{1-p_i}\bigg)=\beta_0 + \beta_1x_{1_i} + \beta_2x_{2_i}+\beta_3x_{3_i} + \beta_4x_{4_i}\]
Also in this case the prior parameters \(\beta\) will have a prior distribution with \(\mu = 0\) and \(\sigma = 0.0001\)
\[\beta_i\sim N(0,0.0001)\]
for \(i=1,2,3,4\).
As done before, the model is implemented with RJags
N = length(train$Hazardous)
log_reg <- function(){
# Likelihood
for (i in 1:N){
y[i] ~ dbern(p[i])
logit(p[i]) = beta0 + beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i]
}
#Prior beta parameters
beta0 ~ dnorm(0, 0.0001)
beta1 ~ dnorm(0, 0.0001)
beta2 ~ dnorm(0, 0.0001)
beta3 ~ dnorm(0, 0.0001)
beta4 ~ dnorm(0, 0.0001)
}
set.seed(123)
model2 = jags(data = data_sim,
model.file = log_reg,
parameters.to.save = params,
n.chains = 2,
n.iter = 10000,
n.burnin = 1000,
n.thin = 10)
| μ.vect | σ.vect | Quantile_0.25 | Quantile_0.50 | Quantile_0.75 | R.hat | N_eff | |
|---|---|---|---|---|---|---|---|
| β_0 | -3.980 | 0.102 | -4.012 | -3.984 | -3.959 | 1.000 | 1800 |
| β_1 | 0.198 | 0.013 | 0.189 | -0.199 | 0.208 | 1.002 | 1800 |
| β_2 | -0.455 | 0.021 | -0.467 | -0.456 | -0.444 | 1.001 | 1800 |
| β_3 | -3.315 | 0.123 | -3.356 | -3.321 | -3.287 | 1.000 | 1800 |
| β_4 | -1.528 | 0.073 | -1.560 | -1.531 | -1.503 | 1.001 | 1800 |
| Deviance | 32486.199 | 414.230 | 32468.279 | 32469.978 | 32472.291 | 1.000 | 1800 |
DIC = 118166.2
Also in this case there is convergence for the parameters, but they are higher in module than the probit ones.
\(\beta_3\) and \(\beta_0\) have the highest posterior uncertainty.
The behavior of the parameters in the simulation is pretty similar to the probit’s behavior, the only difference is in the mean of the parameters (and so the “line where they turn around”).
Anyway, all the processes look stationary.
The empirical averages follow a pattern very similar to the parameters of the probit, wit the difference that \(\beta_1\) has a cleaner behavior and converges faster to I.
Any par #### Density
For the same reason as before, I prefer to use the HPD interval instead of EQ.
| lower | upper | |
|---|---|---|
| beta0 | -4.0993935 | -3.9340667 |
| beta1 | 0.1756789 | 0.2270450 |
| beta2 | -0.4898944 | -0.4209132 |
| beta3 | -3.4782841 | -3.2677423 |
| beta4 | -1.6909881 | -1.5244092 |
| deviance | 32494.8289698 | 32506.8282006 |
There is a difference between the probit parameters and the logit ones: \(\beta_0, \beta_3\) and \(\beta_4\) have a slower convergence and an higher autocorrelation.
| beta0 | beta1 | beta2 | beta3 | beta4 | deviance | |
|---|---|---|---|---|---|---|
| Lag 0 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 |
| Lag 10 | 0.5155044 | 0.0103044 | 0.1591531 | 0.5766383 | 0.5811649 | 0.0936733 |
| Lag 50 | 0.1216148 | 0.0073981 | 0.0264474 | 0.1434062 | 0.1595974 | 0.0031139 |
| Lag 100 | 0.0117419 | -0.0230559 | -0.0013087 | 0.0180058 | 0.0281247 | -0.0002063 |
| Lag 500 | 0.0109224 | -0.0083524 | 0.0005585 | 0.0103798 | 0.0069378 | -0.0003884 |
The correlation of \(\beta_0\) and \(\beta_3\) with the other parameters is higher than the \(\beta_0\) and \(\beta_3\) of the probit.
The MCSEs for the logistic regression are the following.
| Approximation_Error | |
|---|---|
| β_0 | 0.0099490 |
| β_1 | 0.0004612 |
| β_2 | 0.0010909 |
| β_3 | 0.0127902 |
| β_4 | 0.0078298 |
The comparison of MCSEs is important because we can measure the reliability of the parameters of the different model. In this case, the parameters \(\beta_0,\beta_1\) and \(\beta_2\) are better in the probit model, while \(\beta_3\) and \(\beta_4\) are better in the logistic regression.
Now it’s time to evaluate the model from the point of view of the forecasting.
#Parameters
beta0 = model2$BUGSoutput$summary["beta0", "mean"]
beta1 = model2$BUGSoutput$summary["beta1", "mean"]
beta2 = model2$BUGSoutput$summary["beta2", "mean"]
beta3 = model2$BUGSoutput$summary["beta3", "mean"]
beta4 = model2$BUGSoutput$summary["beta4", "mean"]
x = (test[1:4])
y = test[,5]
#eta
XTb = beta0 + beta1*x[1] + beta2*x[2] + beta3*x[3] + beta4*x[4]
preds = rep(NA, length(XTb[,1]))
predictions_t = function(xTb, t){
#preds = rep(NA, length(XTb[,1]))
predictions = rep(NA, length(XTb[,1]))
for(i in 1:length(XTb[,1])){
preds[i] = 1/(1+exp(-xTb[i,1]))
}
for(i in 1:length(XTb[,1])){
if(preds[i]>t)
predictions[i] = 1
else
predictions[i] = 0
}
return(predictions)
}
By looking at the ROC curves we can now choose the best model for the task:
The best threshold for this task is \(t=0.25\).
In fact it gives us the following values:
freq_logit = glm(Hazardous ~ HypotheticalDiameter + RelativeVelocity + MissDistance + AbsoluteMagnitude, family = binomial(link = "logit"),data=train)
AIC = 32500
Also in this case the best threshold is \(t=0.25\), giving the following values:
Unlike logit and probit, the complementary log-log (cloglog) model is asymmetrical and it is frequently used when the probability of an event is very small or very large.
The assumptions in this case are
\[ Y_i \sim Ber(cloglog(p_i)) \]
\[ cloglog(p_i) = log(−log(1−p_i)) = \beta_0 + \beta_1x_1 + \beta_2x_2 +\beta_3x_3 +\beta_4 x_4 \]
And the assumption for the prior distribution of \(\beta\) is
\[ \beta_i \sim N(0,0.0001) \]
for \(i=1,2,3,4\)
N = length(data$Hazardous)
cloglog <- function(){
# Likelihood
for (i in 1:N){
y[i] ~ dbern(p[i])
cloglog(p[i]) <- beta0 + beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i]
}
beta0 ~ dnorm(0, 0.0001)
beta1 ~ dnorm(0, 0.0001)
beta2 ~ dnorm(0, 0.0001)
beta3 ~ dnorm(0, 0.0001)
beta4 ~ dnorm(0, 0.0001)
}
set.seed(123)
model3 = jags(data = data_sim,
model.file = cloglog,
parameters.to.save = params,
n.chains = 2,
n.iter = 10000,
n.burnin = 1000,
n.thin = 10)
| μ.vect | σ.vect | Quantile_0.25 | Quantile_0.50 | Quantile_0.75 | R.hat | N_eff | |
|---|---|---|---|---|---|---|---|
| β_0 | -3.920 | 0.036 | -3.943 | -3.920 | -3.896 | 1.000 | 1800 |
| β_1 | 0.158 | 0.011 | 0.150 | -0.158 | 0.165 | 1.002 | 810 |
| β_2 | -0.404 | 0.015 | -0.414 | -0.405 | -0.395 | 1.002 | 1800 |
| β_3 | -3.090 | 0.046 | -3.120 | -3.090 | -3.059 | 1.001 | 1800 |
| β_4 | -1.540 | 0.038 | -1.565 | -1.540 | -1.515 | 1.001 | 1800 |
| Deviance | 32693.754 | 3.075 | 32691.502 | 32693.086 | 32695.484 | 1.000 | 1 |
DIC = 32698.5
As far as posterior uncertainty is concerned, it is pretty similar to the previous ones, but the interesting behavior of this chain regards the diagnostics:
From the plots the chains looks stationary.
The moving-mean plots are less smooth than the previous, in fact the convergence arrives lately, especially for the second chain (the dark-green chain).
Even if the distribution may look a little bit symmetrical, it’s not
So I evaluate the HPD intervals as done before
| lower | upper | |
|---|---|---|
| beta0 | -3.9981032 | -3.8552132 |
| beta1 | 0.1356655 | 0.1781634 |
| beta2 | -0.4303237 | -0.3777290 |
| beta3 | -3.1740849 | -2.9889132 |
| beta4 | -1.6182374 | -1.4711771 |
| deviance | 32689.0941526 | 32699.0877019 |
| beta0 | beta1 | beta2 | beta3 | beta4 | deviance | |
|---|---|---|---|---|---|---|
| Lag 0 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 |
| Lag 10 | 0.5945371 | 0.0123484 | 0.0192054 | 0.7165430 | 0.5516016 | 0.0889108 |
| Lag 50 | 0.1612906 | -0.0195376 | 0.0092633 | 0.1670784 | 0.1241136 | 0.0025466 |
| Lag 100 | 0.0148105 | 0.0266597 | 0.0005488 | 0.0118273 | 0.0022424 | 0.0123025 |
| Lag 500 | -0.0123488 | 0.0017639 | -0.0158581 | 0.0014514 | 0.0277657 | -0.0230362 |
The autocorrelation plots are pretty similar to the logit plots: there is so similar correlation between values of the chains at different states.
In this case the parameters are less correlated: in fact \(\beta_0\) is less correlated with \(\beta_2\) and the deviance. The overall correlation is lower than the previous for every parameter.
In this case the MCSEs are higher than the probit ones, but lower than some AE of the logit.
| Approximation_Error | |
|---|---|
| β_0 | 0.0035653 |
| β_1 | 0.0004490 |
| β_2 | 0.0005642 |
| β_3 | 0.0048462 |
| β_4 | 0.0035437 |
\(\beta_0\) has the biggest approximation error and \(\beta_1\) has the lowest.
preds = rep(NA, length(XTb[,1]))
XTb = beta0 + beta1*x[1] + beta2*x[2] + beta3*x[3] + beta4*x[4]
predictions_t = function(xTb, t){
predictions = rep(NA, length(XTb[,1]))
for(i in 1:length(XTb[,1])){
preds[i] = 1-exp(-exp(xTb[i,1]))
}
for(i in 1:length(XTb[,1])){
if(preds[i]>t)
predictions[i] = 1
else
predictions[i] = 0
}
return(predictions)
}
For the best threshold \(t=0.3\), the values obtained are
The final comparison is with the frequentist cloglog regression.
train$Hazardous = as.numeric(train$Hazardous)
freq_cloglog = glm(Hazardous ~ HypotheticalDiameter + RelativeVelocity + MissDistance + AbsoluteMagnitude, family = binomial(link = "cloglog"),data=train)
AIC = 32700
For the threshold \(t=0.3\) we have:
It’s time to choose the best model among the six analyzed. The most important thing to know from this analysis is whether an asteroid is potentially harmful for our planet.
In order to complete this task, the best model for this analysis is the one with highest precision (the percentage of true positives over the predicted positives) and the highest recall (the percentage of true positives over all the effectively positives).
It’s better to discover as most true positives as possible, sacrificing some false positives, than not notice that some asteroids are very dangerous!
Thanks to the frequentist approach, I discovered that every parameter is very significant (***) for the model.
Now let’s look at the models with the optimal thresholds in order to choose the best one for this problem
| BestThreshold | Accuracy | Precision | Recall | AUROC | F1_score | |
|---|---|---|---|---|---|---|
| Probit Bayesian | 0.35 | 0.878 | 0.941 | 0.924 | 0.614 | 0.933 |
| Probit Frequentist | 0.25 | 0.875 | 0.935 | 0.927 | 0.628 | 0.931 |
| Logit Bayesian | 0.25 | 0.847 | 0.881 | 0.945 | 0.705 | 0.912 |
| Logit Frequentist | 0.25 | 0.862 | 0.910 | 0.935 | 0.661 | 0.922 |
| Cloglog Bayesian | 0.30 | 0.851 | 0.889 | 0.942 | 0.691 | 0.915 |
| Cloglog Frequentist | 0.30 | 0.868 | 0.921 | 0.932 | 0.649 | 0.926 |
Given the optimal thresholds for our problem, the best model for the prediction is the Bayesian Probit model.
In fact this model guarantees high performance in detecting true positives, avoiding false negatives. The model has the highest overall Accuracy and Precision, so it’s the best one.
| Bayesian_DIC | Frequentist_AIC | |
|---|---|---|
| Probit | 83328.3 | 32250 |
| Logit | 118166.2 | 32500 |
| Cloglog | 32698.5 | 32700 |
DIC and AIC are information criteria. The lower is the Criteria, the best could be the model.
In this case the best Bayesian model would be the Cloglog, and the best frequentist would be the Probit.
Even if these Information Criteria suggest to chose a different model, my predictions suggest to chose the Bayesian Probit model to study this problem.
In conclusion, all the models have fitted well with the data.
The models differ very slightly, but for this kind of analysis the best model is the probit one.
NASA - Nearest Object: https://www.kaggle.com/datasets/sameepvani/nasa-nearest-earth-objects
Gibbs Sampling for the Probit Regression Model: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.154.268&rep=rep1&type=pdf